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Abstract 

Although the role of lateral gene transfer is well recognized in the evo¬ 
lution of bacteria, it is generally assumed that it has had less influence 
among eukaryotes. To explore this hypothesis we compare the dynamics of 
genome evolution in two groups of organisms: Cyanobacteria and Fungi. 
Ancestral genomes are inferred in both clades using two types of methods. 
First, Count, a gene tree unaware method that models gene duplications, 
gains and losses to explain the observed numbers of genes present in a 
genome. Second, ALE, a more recent gene tree-aware method that recon¬ 
ciles gene trees with a species tree using a model of gene duplication, loss, 
and transfer. We compare their merits and their ability to quantify the 
role of transfers, and assess the impact of taxonomic sampling on their 
inferences. We present what we believe is compelling evidence that gene 
transfer plays a significant role in the evolution of Fungi. 


1 Introduction 

Reconstructing genome evolution and ancestral genomes is instrumental to un¬ 
derstanding the diversification of life on Earth. Doing so requires harnessing 
the information available in complete genome sequences, which is best achieved 
in a statistical framework. Integrative methods to reconstruct the evolution of 
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genomes and thus ancestral genomes are now able to model particular histories 
of genes inside a general history of genomes and can integrate many different 
types of events. They integrate sequence-level events such as substitutions, 
gene-level events such as duplications (D), losses (L), and exchanges of genes 
between genomes, modeled by lateral gene transfers (hereafter transfers, T), as 
well as genome-level events such as speciations (S). This inclusiveness enables 
them to handle diverse groups of organisms, each with their idiosyncratic way 
of evolving. It therefore becomes possible to apply a single method to groups 
from different domains of life, and compare their modes of evolution. 

Reconstructing ancestral genomes requires a minima two types of data: ex¬ 
tant genomes with homology relationships between genome fragments, and a 
tree along which these genomes are supposed to have evolved. A species tree 
modeling vertical descent is indispensable, because without it, we cannot differ¬ 
entiate vertical inheritance from lateral transfer, and little can be learned about 
the processes of genome evolution. 

Using a common species tree does not mean that we assume that all ho¬ 
mologous fragments have had the exact same history. Instead, the history of 
each individual homologous fragment is reconstructed, with its own succession 
of duplications, losses and transfers. For species that have diverged a long time 
ago, only the protein coding portion of the genomes is analyzed, and individ¬ 
ual histories are reconstructed for each gene family. These gene histories are 
subsequently analyzed together to gain insight into genome evolution, and infer 
large-scale patterns of gene duplications, losses, or transfers. Both steps, first 
gene tree reconstruction and second aggregation of gene histories into coherent 
patterns, necessitate thoughtful methodologies to overcome possible sources of 
errors and uncertainties. 

1.1 Reconstruction of gene histories 

Gene sequences are often too short to contain sufficient information for accurate 
and robust reconstruction of the history of a gene family; worse, even when 
this information is present, models of sequence evolution may fail to capture 
it correctly. In general, a gene family’s history cannot be reliably inferred, 
nor interpreted in terms of gene-level events from the set of sequences alone 
mm- Using additional information coming from the species tree is a way to 
improve gene tree quality (Fig. [l]). This is the approach taken by ’’Gene tree- 
aware approaches”. Alternatively, it is possible to entirely do away with the 
sequences and avoid gene tree reconstruction: the gene tree unaware ’’Gene 
content approach” considers only gene presence/absence patterns, or numbers 
of genes per species. 

1.1.1 Gene content approaches 

Gene content approaches work with data in the form of either presence/absence 
of a gene family inside a given genome, or numbers of genes of a gene family 
inside a given genome (Fig. [TJ. In both cases, parsimony approaches or prob¬ 
abilistic models have been used to reconstruct the evolution of gene families 
along a species phylogeny. 

Among parsimony methods, one can choose between Wagner and Dollo par¬ 
simony. Choosing Dollo parsimony amounts to making a strong assumption 
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Figure 1: How a gene history can be incorrectly reconstructed if the gene tree is 
not taken into account, or if taxonomic sampling is incomplete. Left: Inference 
according to the gene content approach. Middle: Inference according to a gene 
tree-aware approach. Right: Inference according to a gene tree-aware approach, 
with a more complete taxonomic sampling. Ignoring gene phylogeny and having 
insufficient species sampling lead to underestimation of gene transfer. In all 
cases, the same true gene history is assumed, but only with sufficient taxonomic 
sampling and with a gene tree-aware approach can it be recovered. 


about the pattern of gene family evolution, as it means that a gene family can 
be gained only once, on a single branch of the species phylogeny. In short, 
this means that gene transfers are forbidden. Wagner parsimony can be more 
moderate in its assumptions, but still requires that costs be defined for all types 
of events involved in gene family evolution, i.e. duplications, transfers, losses. 
There is no objective way to set these costs, and users often try a range of costs, 
eyeball the results, and choose the costs that produce the evolutionary scenarios 
that seem most reasonable [31. The most systematic approaches use ancestral 
genome sizes to pick costs that generate ancestral genomes that are neither too 
big nor too small, but still lack a proper statistical framework 120 - 

Probabilistic approaches either rely on an ad hoc adaptation of substitution 
models used to describe sequence evolution 0, or rely on a birth-death model 
that includes rates of gene duplication, transfer, and loss 130. They can include 
corrections for unobserved data, i.e. gene families that are present in none of the 
sampled species but that were present in ancestral species 0. These approaches 
do not require arbitrary choices of costs: instead rates are estimated from the 
data. Different models can then be tested against each other, for instance to 
test whether there is significant support for the presence of gene transfer in 
the data. These tests rely on the well-known machinery for model testing, 
and include likelihood ratio tests, Akaike or Bayesian Information Criterion, or 
Bayes factors if inference is performed in a Bayesian setting. 

Whether they are analyzed by parsimony or probabilistic approaches, gene 
content data are limited in their ability to detect events of gene family evolution. 
Even the approaches that use the numbers of genes and not just their pattern of 
presence/absence will make mistakes that approaches based on the consideration 
of gene tree topologies could avoid, if the gene trees are accurately reconstructed 
(see middle graph in Fig. [I]). 


3 




1.1.2 Gene tree-aware approaches 

Most gene families share parts of their histories, i.e. have been inherited together 
from ancestors to descendants during parts of their history (Fig. [I]). If we can 
reconstruct the parts of their histories where genes have co-evolved, then jointly 
reconstructing gene histories can be very helpful, because more information is 
available to reconstruct each gene history. In cases where there is no gene 
transfer, then all genes share a common pattern of descent along the species 
tree. When genes can be transferred, they may share only part of their history 
with other gene families. 

Gene tree-aware approaches were first used to deal with incomplete lineage 
sorting (ILS) through the multispecies coalescent [9]. In that framework, all 
gene families have evolved within the boundaries of the species history, and 
heterogeneities among gene histories originate from population-level sorting of 
alleles only. More recently, similar models have been proposed to deal with 
other processes of genome evolution, namely gene duplication, transfer, and 
loss (DTL). For an in-depth review, please see pQ. With these models, gene 
families can have a wider array of histories, and can differ drastically from 
the species tree. Invariably, whether they deal with ILS or DTL events, gene 
tree-species tree models have been found to produce gene trees that are more 
accurate than competing approaches. This is expected: as more information is 
used to reconstruct gene trees, stochastic error should diminish. 

Much like gene content approaches, gene tree-aware approaches can be based 
on probabilistic models that include parameters for DTL events uni nu nans], 
or on parsimonious models, in which case DTL events are associated with costs 
[HHUHS]. Gene tree-species tree approaches however are computationally 
challenging. Interpreting a gene tree in the light of a species tree by placing 
events of gene duplication, transfer and loss, a process called reconciling a gene 
tree, is not difficult provided rates or costs of events are provided. Things 
get more complicated when the gene tree is not assumed to be known, and 
needs to be reconstructed. Naturally, if the species tree itself also needs to be 
reconstructed, then the task becomes extremely difficult; however in the rest of 
this article we will assume the species tree is known without uncertainty. 

Methods to reconstruct gene trees using gene tree-aware approaches can use 
tree exploration heuristics similar to those found in commonly-used programs 
for phylogenetic tree reconstruction [m on cna 12 a eu, as in Phyldog m or 
in DLRS [22] • These approaches however tend to be slow, which motivated 
other approaches based on the consideration of a set of candidate gene trees 
obtained using faster approaches that do not consider a species tree. These 
approaches include TreefixDTL [15] , ALE [231 : 2Ii and TERA P2]. The latter 
two approaches are extensions of an idea initially proposed in [5] and formalized 
in [23] and are particularly fast and accurate. They are based on the ’’amal¬ 
gamation” idea. Based on a sample of gene trees, amalgamation is a dynamic 
programming algorithm that allows the exhaustive exploration of a large space 
of gene trees. In fact, based on a limited set of gene trees, amalgamation allows 
considering a much larger space of gene trees, because it can piece together 
clades from several trees at a time to generate new trees, not present in the 
initial sample of gene trees. This technique is found to improve on competing 
approaches eiiis] in both speed and accuracy. 

Probabilistic gene tree-aware approaches can also be used to date trees. In 
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such cases, gene tree-aware models often reconstruct ultrametric gene trees, 
a model describing the rate of sequence evolution needs to be used, and an 
ultrametric species tree whose nodes are anchored in time is required .9j l25l 
EH E3 H3]. Although these models contain additional parameters that need 
to be estimated, and are therefore computationally more complex to handle, 
they provide the ability to date events of gene family evolution along with 
the ability to estimate rates of events. Rates of events can then be compared 
across clades, although Fig. [l] is here to remind us that taxonomic sampling 
can have a non-trivial impact on the rates of reconstructed events. Another set 
of approaches avoids modelling the rate of sequence evolution and yet anchors 
events in time usual- These approaches use a rooted ultrametric species tree 
in which nodes are ordered relative to each other, and mandate that transfers 
occur only between contemporaneous lineages. Gene trees however do not need 
to be ultrametric, which makes it possible to avoid using a model describing 
the rate of sequence evolution. Whether they use models describing the rate of 
sequence evolution or not, models that use ultrametric species trees are more 
realistic than models in which the nodes of the species tree are not ordered, 
because they include the constraint that only contemporaneous lineages can 
exchange genes; however, this constraint comes at a high computational cost. 

1.1.3 The impact of incomplete taxonomic sampling 

No matter how complex our models of genome evolution, our inferences depend 
on the sampling of our data set (Fig. [I]). Although progress in sequencing 
methods is moving at a fast pace, and genome sequences keep accumulating in 
databases, we will always be missing a clade or species that will prevent our 
data sets from being complete. It is unclear how such missing data impacts our 
inferences. Fig. [l] shows that missing species can lead to incorrectly interpret 
transfer events as duplication events, both for gene content and gene tree-aware 
approaches, but the magnitude of this effect is unknown. Worse, if our sampling 
of a clade misses a group of species with idiosyncratic characteristics (e.g. larger 
genomes, larger rates of gene transfer...), then our estimate of the parameters 
of genome evolution for this group will be biased. In the hope of achieving an 
unbiased estimate of genome evolution, it is important to try to quantify the 
bias imposed by incomplete taxonomic sampling. 

1.1.4 Comparing gene tree-aware and unaware approaches 

Although reconstructing genome evolution is a widely pursued endeavour, there 
have been few assessments of the inference methods used to reconstruct gene his¬ 
tories along the species tree. In this article we compare gene-content approaches 
with gene tree-aware approaches by using publicly-available software on two 
well-known clades in the tree of life. We use a state-of-the-art probabilistic gene- 
content approach, Count mm, and a probabilistic gene tree-aware approach, 
ALEmLundated (available at https://github.com/ssolo/ALE), adapted to han¬ 
dle undated species trees. We address the impact of incomplete taxonomic sam¬ 
pling by performing rarefaction studies, whereby species are pruned from our 
species trees and DTL rates are compared across samples. Our primary aim is 
to focus on the inferences of the two methods and explain their differences in 
the light of their strengths and shortcomings. In the process, we will contrast 
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genome evolution in Cyanobacteria and Fungi. 


1.2 Genome evolution in Fungi and Cyanobacteria 

Fungi and Cyanobacteria a priori differ in the way their genomes have evolved. 
For instance Fungi undergo whole genome duplications, whereas such events 
have not been reported in Cyanobacteria. While gene transfer has been claimed 
to occur in both Cyanobacteria and Fungi, it is unclear how frequent this process 
has been in these two clades. Another question of interest concerns highways of 
gene transfers, i.e. pairs of branches or clades that appear to have undergone 
a high amount of gene transfers. While several highways of gene transfers have 
been claimed to exist in Bacteria, including in Cyanobacteria [29] , it is unknown 
whether there are highways of gene transfers in Fungi as well. 

Both Cyanobacteria and Fungi have been the focus of several studies ad¬ 
dressing genome evolution, because they display a wide variety in cell types and 
genome sizes, and because they have had an important environmental impact 
throughout their history. In the context of this article, these clades constitute 
excellent case studies to assess the behaviour of gene content and gene tree- 
aware approaches because of their wide diversity in genome size, along with 
the fact that different evolutionary dynamics are expected in Eukaryotes and 
Bacteria. 

1.2.1 Genome evolution in Fungi 

Fungi are characterized by two life forms: one, yeast-like, is unicellular. The 
other is multicellular and includes fungi with macroscopic fruiting bodies as well 
as filamentous fungi. In this study, we focus on the clade Dikarya, a subking¬ 
dom of fungi that account for roughly 98% of described species. This clade is 
composed of two well-characterized phyla, Basidiomycota and Ascomycota. We 
use the genome sequences included in the HOGENOM database [30! • These 
two clades display a wide variety in genome sizes (from 5,200 to 10,000 protein 
coding genes, approximately), and have a phylogeny that can be unambiguously 
rooted between Basidiomycota and Ascomycota. Studies of genome evolution 
in these clades have focused for instance on the impact of whole genome du¬ 
plications [31], on the evolution of the yeast (unicellular) form [32], or on the 
evolution of pathways for the decomposition of plant material [331 3U • Re¬ 
cently, there have been reports of noticeable amounts of gene transfers in Fungi 
[351 (Ml [33- In particular several examples indicate that the Aspergillus genome 
has been ” sculpted by gene transfer” [35] ■ This is consistent with reports that 
lateral gene transfers have been important throughout eukaryotic evolution ! 39]. 

1.2.2 Genome evolution in Cyanobacteria 

Cyanobacteria contain both unicellular organisms as well as organisms with 
two cell types, or that organize in filaments, which makes them unique among 
Prokaryotes for their ability to leave a recognizable trace in the fossil record 
[40] . They display a wide range in genome size (from 1,200 to 4,500 protein 
coding genes, approximately), and have had a lasting impact on the Earth with 
the release of massive amounts of oxygen in the atmosphere billions of years 
ago HU- From a phylogenomics perspective cyanobacterial genomes share a 
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relatively large core genome that allows the reconstruction of a well supported 
species phylogeny despite the antiquity of the phylum. Cyanobacteria have also 
served as a model system for investigating horizontal gene transfer [lO], and 
have been reported to display highways of gene transfers [29]. 

2 Methods 

2.1 Data set construction 

2.1.1 Fungi 

First, we selected all the species belonging to Fungi in the HOGENOM6 database[30], 
yielding 32 species. We retrieved the protein sequences clustered into homolo¬ 
gous gene families (21701 families, discarding the very large families HOGIOOOOOOOO, 
HOG200000000 or HOG300000000, for which no alignment is available in the 
database). We discarded 8662 families containing only 2 or 1 genes from Fungi. 
Gene trees were constructed for 1791 families containing only three genes (triplets), 
for which a single topology is possible. We aligned all families with 4 sequences 
or more using MUSCLE [12! with default parameters and selected reliably 
aligned sites using GBLOCKS [T3| . The parameters employed were ’’minimum 
number of sequences for a conserved position” bl = 50, ’’minimum number of 
sequences for a flank position” b2 = 50 and ” allowed gap positions” b5 = a (all). 

To estimate computing time per family, we measured the time PhyloBayes took 
m to compute 10 trees based on each alignment. We discarded the decile of 
the slowest families. For each remaining alignment we ran 2 chains using Phy¬ 
loBayes, calculating 5500 gene trees (discarding the first 500 as burn-in), using 
the LG model of evolution [44]. In the end we were able to compute at least 
one chain for 9596 gene families. Combined with the 1791 triplets, our data set 
contains in total 11387 gene families, totaling 135346 genes, while 24327 genes 
were discarded during our selection process (not counting the three HOGENOM 
families without alignments). 

Given that the tree of Fungi is still unresolved with Microsporiclia branching 
in an undefined place, we decided to use a smaller dataset, comprising the 
clade of Dikarya (28 species). This has the advantage that this clade can be 
easily rooted between Ascomycota and Basidiomycota. We pruned the gene 
trees removing from them two species of Microsporidia as well as Allomyces 
macrogynus and Spizellomyces punctatus , which belong to other basal clades of 
Fungi. In total, we used 11295 gene families. The gene trees are well resolved 
with an average posterior support of 0.97 (median=l). 

Due to the uncertain position of Aspergillus nidulans, we relied on two species 
trees: one reconstructed from a concatenate, and one drawn from the literature. 

For our first tree, which we call tree A, we used a concatenate of 529 near uni¬ 
versal single-copy gene family alignments (25 or more species represented out of 
28). In total the alignment contained 221 127 amino-acid sites including 24 514 
without missing data. Both PhyML m using the LG model of evolution [13] 
and a Gamma distribution to account for rate variation | To and Phylobayes 
[21] using the CAT model with Poisson exchangeabilities [46] recovered the 
same topology. We rooted the species tree between Ascomycota and Basid¬ 
iomycota. The resulting phylogeny identifies the major clades, Pezizomycotina, 
which groups Neurospora crassa and Aspergillus fungi, and Saccharomycotina, 
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which notably groups Yarrowia lipolytica , Candida species, and Saccharomyces. 
Our phylogeny is in agreement with the phytogenies of m , but in their study 
the position of Aspergillus nidulans changes depending on the method: a con¬ 
catenate based on 153 universal genes places it next to Aspergillus fumigatus , 
as we do, but supertree methods find it at the base of the Aspergillus clade. To 
account for these discrepancies, we reconstructed a second species tree where 
Aspergillus nidulans is at the base of the Aspergillus clade. We call this tree B, 
and estimated its branch lengths using PhyML with the same model as above. 
Tree A and Tree B can be found in the supplementary material. 

2.1.2 Cyanobacteria 

We selected all the species belonging to Cyanobacteria in the HOGENOM6 
database[50], yielding 40 species. We reconstructed an unrooted species phy¬ 
logeny using PhyML m using the LG model of evolution [33] and a gamma 
distribution to account for rate variation [451 from a concatenate of 470 near 
universal single-copy gene family alignments (38 or more species represented 
out of 40). In total the alignment contained 126 180 amino-acid sites including 
67 646 without missing data. The resulting tree agreed with our genome-scale 
reconstruction m and other previous phylogenomic results (see discussion in 
[TO] , available in the supplementary material). We rooted the species tree ac¬ 
cording to [TO]. For 7415 gene families with 3 or more genes we employed the 
alignment procedure and sampling procedure described above. The gene trees 
are well resolved with an average posterior support of 0.96 (median=l). 

2.2 Inference methods 

2.2.1 Count 

Count is a software package for performing studies in gene family evolution. 
It can perform ancestral genome reconstruction by posterior probabilities in a 
phylogenetic birth-and-death model [28]. Rates were optimized using a Gain- 
loss-duplication model, with default parameters and allowing different gain- 
loss and duplication-loss rates for different branches. One hundred rounds of 
optimization were computed. 

2.2.2 ALEml-undated 

ALE ml -undated implements a probabilistic approach to exhaustively explore 
all reconciled gene trees that can be amalgamated as a combination of clades 
observed in a sample of gene trees [24] in the context of different species tree-gene 
tree reconciliation models, in particular the model described in [23] , which allows 
for the duplication, transfer and loss of genes. ALE can be used to efficiently 
approximate the sum of the joint likelihood over amalgamations and to find the 
reconciled gene tree that maximizes the joint likelihood among all such trees 
or sample the space of possible reconciliations. Here, we use two reconciliation 
methods, a simplified Duplication, Transfer and Loss (DTL) approach that does 
not consider the temporal information from the species tree, and a version of this 
model that only allows Duplication and Loss (DL). These methods are available 
as part of the open-source ALE project ( https://github.com/ssolo/ALE ). 


2.3 Analyses 

Highways were identified between pairs of species that exchange large numbers 
of genes. The number of genes exchanged was averaged over 100 reconciliations 
drawn from ALEmljundatedusing the program ALEsample, and summed across 
all gene families in our datasets. 

Synteny information was extracted from gene positions in the genomes. Pair¬ 
wise comparisons of genomes between species were performed. Synteny was 
found to be conserved if a gene had as a neighbor a gene whose ortholog was 
also its own ortholog’s neighbor. For simplicity, only gene families with one 
gene per species were considered in the synteny analyses. For a given pairwise 
comparison, a gene was declared as non-transferred if, along the path between 
two species, no transfer event had affected the gene of each species, and declared 
as transferred otherwise. 


3 Results 

3.1 General patterns of genome evolution 

Fig-0 and 0 show the reconstruction of genome evolution across Fungi and 
Cyanobacteria, respectively, using both Count and ALEml-undated. Although 
Count and ALEmljundated differ in their input data and in the types of events 
they can detect, their inferences are qualitatively similar, finding comparable 
genome size dynamics, and proportions of events on branches. 

In both Cyanobacteria and Fungi, with both methods, a clade with large 
genomes (multicellular Aspergillus clade of molds and the clade including fresh¬ 
water and multicellular cyanobacteria such as Nostoc and Cyanothece) and a 
clade with smaller genomes (unicellular clade of yeasts including Saccharomyces 
and Candida and unicellular planktonic cyanobacteria including Prochlorococcus 
and Synechococcus ) can be observed. 

In Fungi, the clade with large genomes (fungi from the multicellular As¬ 
pergillus clade of molds) shows a large portion of gene transfers on several of its 
branches, whereas gene transfers appear much less prevalent in the clade with 
small genomes (containing the unicellular yeasts). This result confirms earlier 
reports based on smaller data sets of larger amounts of gene transfers in the 
Aspergillus clade than in the yeast clade E3- Several branches show an excess 
of gene duplications compared to gene transfers. Although a whole genome 
duplication (WGD) occurred in the ancestor of Saccharomyces cerevisiae and 
Candida glabrata EU. both models fail to pick an increased amount of gene 
duplications on the relevant branch. They recover an increased amount of du¬ 
plications on the branch leading to Saccharomyces cerevisiae alone, possibly 
because Candida glabrata has lost a large number of genes, which, in the ab¬ 
sence of synteny (which was used by (3Tj to detect the WGD), may have erased 
a large part of the signal supporting the whole genome duplication. The ances¬ 
tor of all Dikarya is predicted to have a very small genome, which is likely the 
consequence of our unbalanced taxonomic sampling, with only 3 Ascomycota. 
Due to this design, families present only in 2 Ascomycota have been discarded 
from our data set, and therefore cannot be inferred at the root. 

In Cyanobacteria, both the clades with large and small genomes appear to 
have similar genome dynamics, with more gene transfers than gene duplications. 
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The ancestor of Cyanobacteria is predicted to have intermediate genome content 
in between that of the clade with small genomes (containing Prochlorococcus 
species) and that of the clade with larger genomes. 

3.2 Gene tree-aware approaches are more sensitive than 
gene content approaches 

By design, gene tree-aware approaches can detect more events than gene-content 
approaches (Fig. [lj. Consistently, ALEmEundated finds significantly more 
transfers than Count, with ALEml.undated finding an average of 0.16 and 
0.07 transfers per gene in respectively Cyanobacteria and Fungi, in contrast to 
Count, which finds 0.14 and 0.06. It is difficult to determine how many of the 
additional transfers are due to ALEml-undated finding true transfer events that 
Count failed to detect and how many result from errors in reconstructed gene 
trees. Simulations do indicate that ALE recovers an unbiased estimate of the 
number of transfers [23] , and in the case of Cyanobacteria reduces the num¬ 
ber of inferred transfers by approximately two-thirds compared to gene trees 
reconstructed without the species tree (by PhyMLjTT]). Furthermore, Fig. Eh 
shows that for cyanobacterial famillies represented in 8 or fewer genomes the 
ALEml .undated and Count estimates closely agree. Regardless of potential 
overestimation of the number of transfers, Fig. [4]A also highlights that the num¬ 
ber of transfer events per gene family is more-or-less homogeneous with respect 
to the number of species represented in the gene family for ALEml-undated, 
but systematically decreases for Count as more complete taxonomic distribu¬ 
tions are approached. We see no biological reason why e.g. families with a 
complete taxonomic distribution would undergo much fewer transfers compared 
to families with slightly incomplete taxonomic sampling. Instead, we believe 
this effect results from a shortcoming of gene tree unaware methods, such as 
Count, whereby they are not able to infer transfer among families with complete 
taxonomic distribution, and progressively lose signal as complete taxonomic dis¬ 
tribution is approached. 

3.3 Duplication and loss methods systematically overesti¬ 
mate ancestral gene content 

The effect of gene transfers on gene phylogenies can be mimicked by a combi¬ 
nation of gene duplications and losses. Therefore gene duplications and losses 
may be sufficient to account for genome dynamics in our two clades, and it is 
legitimate to ask about the need to incorporate transfers. As shown in Fig. [4)3 
comparison of the gene content of extant genomes and reconstructions based on 
gene-tree aware reconstruction that considers transfer (DTL) shows that these 
methods reconstruct ancestral gene contents that are similar to those observed 
for extant genomes. In stark contrast, gene-tree aware reconstructions that 
only account for duplications and losses, but do not consider transfer (DL), in¬ 
fer systematically more genes to have been present in ancestral genomes than 
in extant ones. The largest ancestral genomes are inferred by DL methods in 
the common ancestors of clades where DTL methods predict the most transfers: 
the deepest node in the Aspergillus genus has ancestral gene contents of 14244 
genes according to the DL estimate, in comparison to 8238 genes for the DTL 
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Figure 2: Genome evolution in Ascomycota and Basidiomycota (tree A). Edges 
are color-coded according to the inferred numbers of losses along the branches. 
Crimson bars represent numbers of gene gains (transfers + originations) arriving 
on the branch; taupe bars represent numbers of duplications happening on the 
branch. At each node, genome content size is represented as a green disk. 
Top: inferences from ALEml-undated. Bottom: inferences from Count. The 
corresponding graph for tree B is available in the supplementary material. 
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Figure 3: Genome evolution in Cyanobacteria. Edges are color-coded accord¬ 
ing to the inferred numbers of losses along the branches. Crimson bars repre¬ 
sent numbers of gene gains (transfers + originations) arriving on the branch; 
taupe bars represent numbers of duplications happening on the branch. At each 
node, genome content size is represented as a green disk. Top: inferences from 
ALE ml -undated. Bottom: inferences from Count. 
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Figure 4: Comparison between alternative genome evolution re¬ 

construction methods. (A) Comparing methods that consider transfers 
alongside duplications and losses, we find that the gene-tree aware method 
{ALE ml-undated, 1 circles) infers an order of magnitude more transfers for Fungi. 
The gene-tree unaware method {Count, squares) loses its ability to detect trans¬ 
fers as gene families become universal or near universal (28 species for Fungi, 
40 for Cyanobacteria), while the gene-tree aware method does not. (B) To 
contrast gene-tree aware reconstruction that considers transfer (DTL) to one 
that only accounts for duplications and losses, but does not consider transfer 
(DL) we compare the number of genes in ancestral genomes to the number of 
genes in extant genomes. The DTL method infers ancestral genome sizes that 
fall within the distribution for extant genomes {p > 0.5 two-sided Wilcox test) 
for both Fungi and Cyanobacteria. In contrast gene numbers based on the DL 
method are systematically larger than extant ones {p < 10 -3 for cyanobacteria 
and p = 0.02 for Fungi with a one-sided Wilcox test for both tree A and B). 
Fungi are in dark grey (tree A red, tree B orange online) and Cyanobacteria in 
light grey (green online) throughout. 
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based estimate. The extant gene content in the genus in our sample is between 
7797 (Aspergillus fumigatus A1163) and 8891 ( Aspergillus terreus). 

3.4 Rates of transfers are similar in Fungi and Cyanobac¬ 
teria 

Rates of transfer in Fungi and in Cyanobacteria appear to be very similar, as 
shown by the ALEml-undated inferences (Figure [5j4). This finding does not 
come from differences in the age of the clades as we compare ratios of numbers 
of events, which controls for age. It does not appear to come from incomplete 
sampling either, as Figure [5^3 shows that predictions based on subsampling the 
species in each data set still converge to similar ratios of numbers of events for 
Fungi and Cyanobacteria. To extrapolate the T/(T + D) values we fit an ad hoc 
curve that reaches saturation exponentially starting from an initial value for 0 
species. Using all subsampled replicates a least-squares Marquardt-Levenberg 
algorithm yielded the similar asymptotic values of T/(T + D ), with 0.8 ± 0.1 
(Fungi assuming tree A), 0.7 ± 0.03 (Fungi assuming tree B) and 0.74 ± 0.01 in 
Cyanobacteria. The same procedure for L/(T + D + L) produced the slightly 
higher asymptotic value for Fungi of 0.582 ± 0.01 for tree A and 0.595 ± 0.01 for 
tree B, compared to 0.52 ± 0.01 compared to Cyanobacteria. 

These genome-wide inferences confirm earlier reports based on manual anal¬ 
yses of smaller data sets that significant numbers of transfers occurred in Fungi, 
in particular in the Aspergillus clade [55115511571155] . Overall, these data show 
that genomes in Prokaryotes and Eukaryotes are not undergoing fundamen¬ 
tally different dynamics. We consider that additional analyses of data sets for 
different clades of both Prokaryotes and Eukaryotes, using gene tree-aware ap¬ 
proaches as in this work, would provide a more fine-grained, quantitative view 
of the dynamics of genome evolution across the entire tree of life. 

3.5 There are highways of gene transfer in Fungi 

One feature of genome evolution in Prokaryotes that has received considerable 
attention is the concept of highways of gene transfers [221 08|. According to 
this model, some pairs of species or clades have exchanged large numbers of 
genes throughout their history, possibly because of a shared ecological niche. 
ALEml .undated inferences provide us with an opportunity to look for such 
highways in Cyanobacteria and Fungi. Fig.[6]shows the distribution of the num¬ 
ber of transfers per pairs of branches of the species tree in both Cyanobacteria 
and Fungi. Both distributions show a long tail, with many transfers occurring 
between branches that otherwise have exchanged little genetic material. How¬ 
ever, some pairs of branches show very high numbers of gene transfer events. 
The heterogeneity is strongest in Fungi, where some pairs of branches are pre¬ 
dicted to have undergone more than 150 gene transfers, or even 300 transfers 
on tree B (Fig. [7|. These transfers do not seem to be due to hybridization, 
as most of them are not replacement transfers, whereby a gene in a species is 
replaced by another gene coming from another species (the median branch-wise 
fraction of gene transfers that are compensated by loss on the same branch, i. e. 
replacement transfers, is 31% for Fungi on tree A , 34% on tree B and 49% for 
Cyanobacteria). For the same reason, these transfers cannot be misinterpreted 
events of incomplete lineage sorting. In fact, among genes that have only one 
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Figure 5: Similar rate of transfers in Fungi and Cyanobacteria. The 

cyanobacterial and fungal datasets we considered differ both in the number of 
genomes (40 vs. 28) and in their age (nearly 3 billion years vs. less than three 
quarters of a billion years). (A) In order to account for differences in age we 
consider the ratio T/(T + D), i.e. the fraction of transfer events among all gene 
birth events (duplications plus transfers) and L/(T + D + L) the ratio of loss 
events to all events. (B) To ascertain the effect of the differences in the number 
of genomes we constructed replicates with random subsamples of genomes of 
varying numbers. Extrapolation of the results suggests that T/((D + T) for 
Fungi is as large or larger than for Cyanobacteria (for details see text). Fungi 
are shown with squares for tree A (red online), and circles for tree B (orange 
online) and Cyanobacteria with triangles (green online). 


ortholog per species, genes that have undergone a gene transfer tend to change 
position on the chromosome more often than genes that have not undergone a 
gene transfer (see Fig. [7J right, for the Aspergillus clade). The pairs of branches 
with the largest numbers of transfers belong to the Aspergillus clade, in agree¬ 
ment with the overall larger amount of transfers detected in this clade and in 
agreement with previous reports |38j . The species involved in the largest num¬ 
ber of transfers in either tree A or tree B is Aspergillus nidulans , precisely the 
species whose position is contentious. This suggests that lateral gene transfers 
in Fungi may be significant enough to make reconstruction of the species phy- 
logeny difficult. Although deeper sampling could change the numbers of gene 
transfers found on each pair of branches, for instance by breaking branches in¬ 
volved in a highway, it seems unlikely that the conclusion that there are branch 
pairs or group of branches exchanging large numbers of genes in Fungi would 
change. 

3.6 Genes tend to be transferred together 

The distribution of transferred genes along chromosomes appears to be consis¬ 
tent with the transfer of chromosomal segments that can include more than one 
gene. Counting only transfers to the terminal branches, transferred genes ap¬ 
pear preferentially next to another transferred gene: on average, 4.7 times more 
often in Fungi (on tree A, 6.1 times more often on tree B), and 5.3 times more 
often in Cyanobacteria. Given that genes transferred on terminal branches make 
up a minority of the genomes, this means that transfers tend to affect blocks of 
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Figure 6: Stronger highways in Fungi. Data points, dark gray (red online) 
for tree B and light gray (orange online) for tree A, correspond to numbers 
of transfers between pairs of branches (” highways of transfers”) in either Fungi 
phylogeny plotted in decreasing order. The continuous line (green online) shows 
the mean number of transfers between pairs of branches among 25 replicates 
where a random set of 28 cyanobacterial genomes were chosen as in Fig. [5]B. 
The shaded area shows the 95% confidence interval. Fungi are in dark grey (red 
online) and Cyanobacteria in light grey (green online) throughout. 


several genes at a time. 

3.7 ALEml-iindatedreconstructs accurate gene trees 

In 1231 . we found using realistic simulations that amalgamation of gene trees 
using a DTL model produced accurate gene trees: the number of duplica¬ 
tions and transfers needed to reconcile our reconstructed trees was statistically 
indistinguishable from the corresponding number of events needed to recon¬ 
cile the “real” trees that had been used to simulate gene alignments. In the 
present study, empirical results also show that gene trees reconstructed by 
ALEmnljundated&re accurate. First, the fact that the reconstructions of an¬ 
cestral genome sizes based on our reconciled gene trees are not significantly 
different from extant genome sizes suggests that our gene trees do not contain 
large numbers of incorrect bipartitions. Second, the overrepresentation of trans¬ 
ferred genes in tandem cannot be explained by random errors in gene trees, but 
shows that bona fide information can be retrieved from gene trees reconstructed 
by ALEml-undated. 

4 Conclusion 

Our genome-scale phylogenetic analysis of genome evolution in Cyanobacteria 
and Fungi shows that Fungi exhibit similar rates of transfers as Cyanobacteria, 
and display apparent highways of gene transfers. Whether these highways of 


16 










5 major transfer highways 



Can. albicans wo-1 
Can. dubliniensis cd36 
Sch. stipitis cbs 6054 
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Figure 7: Position of the 5 largest highways in Fungi on tree A. Branch 
lengths are in expected numbers of amino-acid substitutions, and colored accord¬ 
ing to the number of incoming gene transfers and originations. The five largest 
highways of transfers are represented. On the right, boxplots show pairwise 
genome synteny comparisons. Transferred genes are found to change neighbors 
more often than non-transferred genes. On tree B, the five largest highways 
concern 329, 244, 177, 174, 168 genes, the three largest ones of which involve 
Aspergillus nidulans. 
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gene transfers correspond to shared ecological niches or to particular mecha¬ 
nisms to incorporate foreign DNA remains to be investigated. In both clades, 
gene transfers appear to occur in blocks, not just one gene at a time. Further 
investigation of those transferred blocks of genes may prove useful for functional 
annotation, as co-transferred genes may be functionally related. 

This study also allows the comparative study of different methodologies for 
reconstructing genome evolution. We show that the recent developments provide 
a framework adapted to different domains of life, and that gene tree-aware 
methods show more precision in the quantification of gene transfers. 

Our results suggest that further analyses of data sets for other clades of 
Prokaryotes and Eukaryotes, using gene tree-aware approaches , will provide a 
more fine-grained, quantitative view of the dynamics of genome evolution across 
the tree of life. 
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